high quality data
Assessing the Role of Data Quality in Training Bilingual Language Models
Seto, Skyler, ter Hoeve, Maartje, de Seyssel, Maureen, Grangier, David
Bilingual and multilingual language models offer a promising path toward scaling NLP systems across diverse languages and users. However, their performance often varies wildly between languages as prior works show that adding more languages can degrade performance for some languages (such as English), while improving others (typically more data constrained languages). In this work, we investigate causes of these inconsistencies by comparing bilingual and monolingual language models. Our analysis reveals that unequal data quality, not just data quantity, is a major driver of performance degradation in bilingual settings. We propose a simple yet effective data filtering strategy to select higher-quality bilingual training data with only high quality English data. Applied to French, German, and Chinese, our approach improves monolingual performance by 2-4% and reduces bilingual model performance gaps to 1%. These results highlight the overlooked importance of data quality in multilingual pretraining and offer a practical recipe for balancing performance.
Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key
Chen, Yingda, Wang, Xingjun, Huang, Jintao, Mao, Yunlin, Zhang, Daoze, Zhao, Yuze
As large language models rapidly evolve to support longer context, there is a notable disparity in their capability to generate output at greater lengths. Recent study suggests that the primary cause for this imbalance may arise from the lack of data with long-output during alignment training. In light of this observation, attempts are made to re-align foundation models with data that fills the gap, which result in models capable of generating lengthy output when instructed. In this paper, we explore the impact of data-quality in tuning a model for long output, and the possibility of doing so from the starting points of human-aligned (instruct or chat) models. With careful data curation, we show that it possible to achieve similar performance improvement in our tuned models, with only a small fraction of training data instances and compute. In addition, we assess the generalizability of such approaches by applying our tuning-recipes to several models. our findings suggest that, while capacities for generating long output vary across different models out-of-the-box, our approach to tune them with high-quality data using lite compute, consistently yields notable improvement across all models we experimented on. We have made public our curated dataset for tuning long-writing capability, the implementations of model tuning and evaluation, as well as the fine-tuned models, all of which can be openly-accessed.
Getting Deep Learning working in the wild: A Data-Centric Course - KDnuggets
Have you been excited by recent high profile deep learning successes, but not sure how to practically keep deep learning models working for your project? We've developed a distilled set of materials on data-centric deep learning approaches โ which are often among the most impactful tools to get deep learning models working on new tasks. Data-centric deep learning is a relatively new area and a broad term. For us, being data-centric means taking a different perspective on deep learning that's centered around building and maintaining the datasets which define and evaluate deep learning models. The real-world applications and successes of deep learning systems are growing by the day.
Smart Water: Data Labeling with Active Learning And H2O.ai
Data is the food for AI. For machine Learning, or supervised learning, the golden labels are key for the models to recognize the pattern within the data. However, in the real-world data, it is usually hard to get large amount of labeled data, for example, search revelance, news topics, autopilot, etc. Recently, Angrew Ng gave a talk on MLOps: From Model-centric to Data-centric AI, where he mentioned the Idea from Big Data to Good Data. Good data is defined consistently and cover of important cases.
Navigate data management challenges to enable AI initiatives
Navigate data management challenges to enable AI initiatives Smart data management is the foundation of organisation-wide usage of Artificial Intelligence Leading organisations are able to fully leverage the power of Artificial Intelligence and generate value by enabling data professionals to have access to well-organised high quality data from across the entire organisation. But how can this be achieved? Save for later The Deloitte AI Loop (DAIL) The Deloitte AI Loop provides a framework that mimics the human approach in the space of artificial intelligence. Based on our experience in bringing cognitive solutions to our clients, we have lined out DAIL as a blueprint for all aspects that should be covered in a successful AI solution, as we explained in the introductory blog . This is the second article of the DAIL series, focusing on the SENSE component, consisting of tools, technology and infrastructure to measure, capture and monitor data from business processes, behavior and the environment.
AI Development with Bottos: A Simple Use Case โ Bottos โ Medium
Bottos will soon offer great opportunities to support the development of Artificial Intelligence, with the most important step being the data and model marketplace. Thanks to the underlying blockchain infrastructure and other tools like smart contracts, users will be able to monetize their efforts to produce, clean, and ultimately, sell their data safely and conveniently. Bottos will be a great companion to everyone involved in the development of AI models and programs. Karen is a computer scientist that cares greatly for her grandmother, who is increasingly fragile and in need of assistance. While driving, Karen comes up with an interesting idea about an image and speech recognition system that, with the right development, may help seniors live longer in their own houses, autonomously, without moving to a retirement house and limiting the employment of costly nursing services.
Healthcare's Best Shot At Doing AI Right: Make It Invisible
And it won't replace your radiologist. That stated, I agree with Curtis Langlotz, MD, PhD of Stanford, who stated at RSNA this year that radiologists who use AI will replace radiologists who don't. So, what is the path toward making AI a key enabler for medicine? AI-powered healthcare requires three key factors: sound data science, sharp focus and strategic deployment. And, it requires the patience to balance the excitement of advanced digital technology with the practical realities of how healthcare operates.
Barrier to AI in the Enterprise: Access to High Quality Data - AI Trends
According to a recent Teradata study, 80% of IT and business decision-makers have already implemented some form of artificial intelligence (AI) in their business. The study also found that companies have a desire to increase AI spending. Forty-two percent of respondents to the Teradata study said they thought there was more room for AI implementation across the business, and 30% said their organizations weren't investing enough in AI. Forrester recently released their 2018 Predictions and also found that firms have an interest investing in AI. Fifty-one percent of their 2017 respondents said their firms were investing in AI, up from 40% in 2016, and 70% of respondents said their firms will have implemented AI within the next 12 months. While the interest to invest in and grow AI implementation is there, 91% of respondents to the Teradata survey said they expect to see barriers get in the way of investing in and implementing AI.
The NHS is a much bigger challenge for DeepMind than Go
People have a weird obsession with games likes Chess and Go. Achievement in them has long been seen as a marker of human intellect, and yet they're among the least human test you could devise; putting players in simplified situations where everything is known, every possible course of action is laid out for them, and the test is one of concentration and logic. We pass far greater tests daily, when we recognise a face in a crowd, when we dynamically balance in motion, when we predict the response our words and expressions will have on another sentient being, or when we do all of the above, effortlessly, at the same time. We don't think of these as challenging because they're so innately human, while playing Chess or Go seems far more impressive precisely because they're more rigid and computational in nature. There's an irony in making a board game one of the'grand challenges' of AI, and it surprises me that more people don't see it.
The NHS is a much bigger challenge for DeepMind than Go
People have a weird obsession with games likes Chess and Go. Achievement in them has long been seen as a marker of human intellect, and yet they're among the least human test you could devise; putting players in simplified situations where everything is known, every possible course of action is laid out for them, and the test is one of concentration and logic. We pass far greater tests daily, when we recognise a face in a crowd, when we dynamically balance in motion, when we predict the response our words and expressions will have on another sentient being, or when we do all of the above, effortlessly, at the same time. We don't think of these as challenging because they're so innately human, while playing Chess or Go seems far more impressive precisely because they're more rigid and computational in nature. There's an irony in making a board game one of the'grand challenges' of AI, and it surprises me that more people don't see it.